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I I ile compression techniques 
have been used in utilities such 
as PKZIP for years. They've 
I also been commonly em- 
ployed in tape backup systems 
and, more recently, utilities 
such as Stacker have made it 
possible to compress an entire 
JL. hard disk and perform decom- 
pression on the fly. 

How do such products accomplish 
their magic? In this first of a two-part Lab 
Notes, I'll discuss the bases of the com- 
pression systems such as these. In the 
ne sue I'll address the compression 
metuods used to store video and sound 
information. This division corresponds to 
the two broad categories into which com- 
pression techniques fall: lossless and 
lossy. 

LOSSLESS VS. LOSSY A lossless system 
is one that can produce an exact recon- 
struction of the original file upon decom- 
pression. A lossy system — where the 
compression factor is generally higher — 
restores only a close approximation of 
the original. 

The appropriate method depends on 
the type of file you want to compress. A 
few changed bytes in an executable pro- 
gram will likely cause a crash. If a spread- 
sheet file is altered by a few bytes, the 
best result you can hope for is that the 
program will reject the file as corrupt; in 
the worst-case scenario, the data itself 
would be changed. And of course it 
would hardly do if, by compressing and 
decompressing a memo, you changed the 
spelling of your boss's name! 

On the other hand, given a 24-bit true- 
cc' file, a few changed pixels wouldn't 
m a noticeable difference. A digital 
sound file that loses the extreme high fre- 
quencies (those above 20,000 Hz) will 
sound the same to all but your dog. As 
a rule, files that require lossless compres- 
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sion are composed of entered data, while 
files that permit the use of lossy methods 
are made up of sampled data. 

This Lab Notes will provide an over- 
view of lossless compression. (For a de- 
tailed explanation of how the compres- 
sion utility PKZIP works, see the May 25, 
1993, Tutor column "Understanding 
Data Compression" on this topic.) Next 
time I'll talk about lossy compression 
methods, and will also describe some of 
the third-party libraries available to pro- 
grammers who want to implement com- 
pression in their applications. 

YOU CAN'T COMPRESS EVERYTHING Not 
all files can be compressed. Moreover, 
any lossless algorithm that can decrease 
the size of some files must increase the 

How do compression 
utilities shrink files 
to half their original size 
and then reliably restore 
them? The trick is taking 
advantage of patterns in the 
files we want to compress. 



size of other files. These two basic princi- 
ples form a good starting point from 
which to discuss compression. 

If every possible file could be com- 
pressed, then logically it would be possi- 
ble to successively reduce all files to 
bytes. Clearly, a 0-byte file cannot con- 
tain the information needed to restore 
the original, nor can it become anything 
but larger. 

Now let's consider files of one byte. 
Only the 0-byte file is smaller and, as 



noted, cannot contain the information 
necessary for decompression. Also, a 
0-byte file already represents itself. For 
lossless compression, the 0-byte file can't 
represent both the uncompressed 0-byte 
file and a compressed 1-byte file. If there 
isn't a unique compressed file for each 
original file, then the original can't be re- 
stored through decompression. So 1-byte 
files can't be compressed, either. 

Suppose we expand the scope of our 
putative compression algorithm to en- 
compass all the theoretically possible 
files of 2 bytes or less. If each 1-byte file 
(and the 0-byte file) must be allowed to 
retain their size, there aren't any smaller 
files available into which we could map 
a 2-byte file. As we consider larger and 
larger files, the problem remains. 

Since 1 byte can have 256 possible val- 
ues (2 8 ), there can be 256 unique 1-byte 
files. Adding these to the one possible 
0-byte file yields 257 distinct entities 
smaller than 2 bytes. So at most, 257 of 
the 65,536 (2 16 ) possible 2-byte files could 
be compressed to smaller files. But to do 
this, we would have to map the 1-byte and 
0-byte files into the empty 2-byte slots left 
from the compressed 2-byte files. That is, 
to reduce the size of some 2-byte files, 
the 0-byte and 1-byte files must be al- 
lowed to grow. In this case we've done 
a simple swap, but the principle holds 
true for any compression algorithm: If 
you don't allow some files to become 
larger, there will be no smaller files avail- 
able into which larger files can be 
mapped. 

If you ponder this problem a while, 
you may come up with a promising solu- 
tion. Accepting that for some files to 
shrink other files will have to be allowed 
to grow in size, you start with an algo- 
rithm that does just that. Then you go a 
step further. If the algorithm shrinks the 
file, you tack a 0-bit to the start of the 
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compressed file. If the original algorithm 
makes the file larger, you tack on a 1-bit 
that is followed by the original file, 
unchanged. Now you have a method that 
will indeed sometimes increase size, but 
never by more than one bit! 

This trick is useful; indeed, every seri- 
ous compression method allows the 
straight storing of the original, un- 
changed file, listing the method as stored 
in the internal directory of the com- 
pressed file. But simply storing the files 
that would otherwise grow doesn't help 
much, since only a small fraction of all 
files can be compressed even a little. 

We've already seen that only a small 
fraction of all possible 2-byte files can be 
compressed, even if we allow the smaller 
files to grow. There simply aren't enough 
smaller files. Since there are 256 values 
for a byte, the number of possible files 
of n+l bytes is 256 times the number of 
possible files of n bytes. Thus it's never 
possible to compress more than k /m of a 
file of a given size even by 1 byte! 

We've seen, then, that there is simply 
no algorithm that can shrink all possible 
files of a given size. Moreover, only a tiny 
fraction of all possible files can be shrunk 
by even 1 byte, and the fraction becomes 
smaller as the compression factor in- 
creases. Fortunately, however, we don't 
need an algorithm that will compress all 
theoretically possible files. 

A key point that underlies all com- 
pression is that the files you and I actually 
have on our disks are far from random. 
Most theoretically possible files are files 
we'd have no interest in at all, so we can 
safely let these grow and shrink only the 
files we care about. The secret of com- 
pression is to find an algorithm that un- 
derstands how the files we normally use 
depart from randomness, and selectively 
shrinks just those. 

NONRANDOM IS BETTER You can prove 
this to yourself by creating the short Pas- 
cal program listed in Figure 1. Running 
the program will produce a file of 10,000 
random bytes. If you put this file through 
PKZIP, you'll wind up with a file of 
10,112 bytes! (The ZIP overhead for a di- 
rectory entry is 112 bytes.) Plainly, you've 
achieved no compression. 

One element of nonrandomness is the 
uneven frequency with which various 
characters occur in a language. For exam- 
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pie, in English, the letter e is very com- 
mon, but few words contain the letter x. 
Patterns such as these can be exploited 
by using variable-length coding, where 
the length of each character's code de- 
pends on the frequency of the character. 
A parable will illustrate. 

The story is told of a fierce tribe of 
programmers, the Foobari. Their lan- 
guage has four letters: a, e, i, and o. At 
the dawn of the computer age, their stan- 
dards group got together and agreed on 
a set of two-bit codes for their language. 
This FSCII code was, in binary, 

a=00 e=01 i=10 o=ll 

A young entrepreneur, Foo Katz, de- 
termined a typical program written in 
Foobari was 50 percent a's, 25 percent e's, 
and 12.5 percent each i's and o's. Foo 
Katz realized if he instead used the codes 

a=0 e=10 i-ill o=lll 

he could code a typical file in fewer bits. 
For 25 percent of the time he needed 3 
bits instead of FSCII's 2, but 50 percent 
of the time, he only needed 1. So the aver- 
age number of bits per character was 1.75 
rather than 2 — a 12.5 percent decrease in 
file size — and FKZIP was born. 

SHANNON'S THEOREM Could Foo Katz 
have done better than 1.75 bits per char- 
acter? Actually, given the probability of 
each character in Foobari, he couldn't. 
Work done in the late 1940s by Claude 
Shannon at Bell Labs shows this to be 
true. Shannon founded the science of in- 
formation theory and proved what has 
come to be called Shannon's theorem. 



In its simplest form Shannon's theo4 
rem says that if you have n possible char-] 
acters occurring randomly but with non- j 
uniform probabilities p(l), p(2). . .p(n), ! 
the average number of bits you need to 
code a character is 

In an optimal coding scheme, then, the 
number of bits needed to code each char- 1 
acter is -log2p(j), where log2 is the loga- 
rithm to the base 2. That is, character 1 1 
requires -log2p(l) bits, character 2 re- : 
quires -log2p(2) bits, and so on. Shannon , 
says you cannot do any better than this, j 
Drawing an analogy with a similar-look- ; 
ing quantity introduced by nineteenth^ 
century physicists studying statistical me- 
chanics, Shannon called the quantity N 
the entropy. 

As a simple example, suppose all the 
p(j)'s are equal. If there are n characters, 
the probability of each is p(j) = 1/n. Since 
-log2(l/n) is log2(n), we see that 

.•oj 

N = (p(l) .+ p(2) +. . .+ p(n))Uogj<Ji>) 
= logj(n) 

This is because, by definition, the sum of 
all the probabilities of the individual 
characters must be 1. 

The Pascal test program of Figure 1 
used characters chosen randomly from all 
256 possible values, and we saw that the 
resulting file could not be compressed at 
all. However, if you go back to that test 
program and change random(256) to ran- 
dom(16), the result will be different. 
When a number smaller than 256 is used. 



TESTFILE.PAS 

Complete Listing 



var 

i:word; 

b:byte; 
randfile : f ile of byte; 

begin 

assign (randfile, 1 foo. bar' ) ; 

rewrite(randfile) j 

for i:=l to 10000 do begin 

b : =random (256) ; 

write(randfile,b) ; 
end; 

close(randfile) ; 



(* variable declarations *) 



program start *) 

give the file the name foo. bar *) 

creates and opens file *) 

start loop, execute 10,000 times *) 

select byte randomly from 256 choices *) 

write out the byte *) 

end loop *) 

close file *) 

program end *) 



Figure 1: Changing random(256) to rand om< 16) in this program listing turns the resulting file from one 
that cannot be compressed at all into one that can be compressed by 42 percent 
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the distribution of bytes is not random 
among all possibilities, even though the 
byte-to-byte changes are still random. 
Run it and try PKZIP again — you'll get 
a 42 percent improvement in file size (us- 
ing PKZIP 2.0). The theoretical best you 
can hope for in the random(16) case is 
50 percent. This is because only four bits 
are needed to generate 16 different char- 
acters (N = log2(16) = 4), and four bits 
are half as much as eight bits. 

Now change the program line to ran- 
dom(200). When you run PKZIP again, 
you'll get a 4 percent improvement over 
the original. If n = 200, then Shannon says 
each character needs log2(200) = 7.644 
bits, or .9554 bytes. So the theoretical 
maximum compression savings is 4.5 per- 
cent for the random(200) case. This 
shows that even a slight departure from 
complete randomness will permit some 
compression. (Of course, it's not clear 
how you can use fractional bits to repre- 
sent something, but I'll get to that later.) 

For the Foobari example, the p's are 
Vi, l A, Vi and V«, and the negative logs are 
1, 2, 3, and 3 — exactly the number of bits 
in the FKZIP code that Foo Katz in- 
vented. The average number of bits 
needed is 

1/2 + (l/4)*2 + (l/8)*3 + (l/8)*3 

'= "1.75 ". -J"" r ■ ' 

Thus, the Foo Katz coding did realize the 
theoretical maximum. 

HUFFMAN ENCODING In 1952, Huffman 
came up with a simple algorithm that re- 
alizes the Shannon theoretical maximum 
when all the probabilities are negative 
powers of 2 (that is, powers of l h: Vz, V*, 
Vs, and so on). Indeed, even when the 
probabilities are general, the Huffman al- 
gorithm works very well. It is also easy 
to implement in a program. Early ver- 
sions of ARC, the first widely popular 
compression scheme on PCs, used Huff- 
man encoding, and it remains one of the 
elements used in many lossless compres- 
sion schemes today. 

In Huffman encoding, a binary tree is 
used to assign codes. As depicted in Fig- 
ure 2, the tree has its root on the left with 
the branches pointing toward the right. 
There are some internal nodes, which 
have one line coming from the left and 
two going off to the right; there are also 




• 1=0.125 

i i 0=0.25 

J 0=0.125 
(c) The tree is extended: 
• a=0.5 




(d) The Huffman tree and its codes: 




some leaves, which 
are nodes that 
have only a line 
from the left. The 
leaves are associ- 
ated with the initial 
set of characters; 
there is one leaf for 
each character. 

To form the 
tree, you first pick 
the two characters 
with the smallest 
probabilities and 
join them to a 
node. Give the 
node the combined 
probability and 
then look at that 
node and the re- 
maining charac- 
ters. Pick the two 
objects (nodes or 
unused characters) 
with the smallest 
probabilities, join 
them to a node, 
and then continue 
until all the charac- 
ters are in the tree. 

Once you have drawn the tree, you 
find the code for a leaf by starting at the 
root and recording the path to that leaf. 
All leaves on the top branch have a for 
the first digit; those on the bottom branch 
have a 1. The number of node levels 
needed corresponds to the number of bits 
needed to code the character. Thus the 
complete tree shown in Figure 2 leads to 
precisely the coding that Foo Katz used 
in the Foobari parable. 

In the random(200) example, no 
branch ends before 7 levels, so you will 
need a minimum of 7 bits to code each 
character. There are a full 128 nodes or 
leaves at level 7. Of these 128 items, 56 
are leaves and 72 are nodes that each con- 
nect to 2 leaves (56 + 2*72 = 200). Thus 
144 branches end at level 8. 

If every probability is a negative 
power of 2, the Huffman algorithm per- 
fectly realizes the theoretical maximum 
predicted by Shannon's theorem and is 
thus optimal. If some probabilities are 
not inverse powers of 2, Shannon's maxi- 
mum won't be realized, but the results are 
still pretty good. For example, in the ran- 



How the Huffman Algorithm Works 



in language 



fter two leaves are joined to a node: 



.#a=0.5 (coded as 



eio=0Ji 




i=0.125 



-•0=0.125 



=0.25 (coded as 10) 




0.125 (coded as 110) 



0=0.125 (coded as 111) 



Figure 2: These tree diagrams show the successive application of the Huffman 
encoding system to the Foobari language. 



dom(200) example, Huffman uses an av- 
erage of 7.72 bits ((56*7 + 144*8)/200), 
which isn't too much larger than the 
Shannon limit of 7.644. . .bits. 

An alternative tree-based procedure 
is called Fano-Shannon encoding. Huff- 
man starts with the leaves and combines 
them, building the tree from the leaves 
back toward the root. Fano-Shannon 
uses the opposite strategy: At each stage, 
it breaks the characters into two groups 
with roughly equal probabilities and 
builds the tree from the root toward the 
leaves. This method produces similar re- 
sults to Huffman encoding, but some pro- 
grammers may prefer implementing one 
over the other. 

SAVING UP FRACTIONAL BITS Suppose 
the Foobari passed a law that henceforth 
every e was to be removed and replaced 
by an a. In that case, text documents 
would have 12.5 percent fs and o's, but 
75 percent a's. Since -log2(.75) is 0.415, 
Shannon tells us that the optimum code 
length for a is less than half a bit. But how 
in the world can you code a character in 
less than one bit? 
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Remarkably, there is an elegant 
method that both effectively uses frac- 
tional bits and is easy to implement in al- 
gorithms. It is known as arithmetic encod- 
ing. Unfortunately, it has not yet found 
much practical use. One reason is that it 
was discovered rather recently — in the 
late 1970s — by several different com- 
puter scientists at about the same time. 
Thus it lacks the mnemonic value of a sin- 
gle name. Moreover, arithmetic encoding 
was not discussed in many of the books 
written thereafter, so many programmers 
have missed it. Finally, some of the early 
work was done at IBM, and IBM pa- 
tented a specific algorithm implementing 
the idea. Although the idea itself cannot 
be patented, some programmers who 
might have stumbled across references to 
arithmethic encoding in the literature 
may also have heard "IBM" and "pa- 
tent" and decided to go elsewhere. 

Arithmetic encoding works best when 
character probabilities are far from 
equal. The revised Foobari language, in 
which the characters are in a ratio of 6:1:1, 
is a good example. To apply arithmetic 
encoding, you should start by thinking of 
the file to be compressed as a character 
string — for example, aio. You need to 
associate this string with a real number 
between and t. To do this, code the 
string one character at a time, using the 
probability of that character to home in 
on a real number that will represent the 
string. This real number is not stored as 
a fixed precision floating-point data type; 
it's stored as a base 2 number represented 
as a string of bits. 

As we go through an example here, 
it may help to imagine a ruler whose sub- 
divisions are marked off in sections pro- 
portional to the probabilities of all the 
characters to be used. In Foobari, then, 
we'd mark off 3 / 4 , and then % each for i 
and o. We'll call the real number that will 
encode the aio string r. 

Since the first character is a, the final 
code for our string must lie inside the a 
range, between and 0.75. Now to code 
the i, which follows the a, we turn our at- 
tention to the a section of the ruler. We 
take that section, and divide it up propor- 
tionally ( 3 / 4 , %, Vs), just as we did when 
coding the a. We've narrowed the range 
for our final code down to the i range: 
0.5625 to 0.65625. To code the o that fol- 



lows the i, we turn our attention to the 
i section, and again divide it up propor- 
tionally and select the o range of this sec- 
tion. Taking the middle value of this 
range, we can say that aio is encoded as 
r=. 650390625. 

For longer strings, each successive 
character is encoded in the same way. As 
you encode more characters into a single 
real number, the range you're using gets 
smaller and smaller so the number of sig- 
nificant digits needed gets larger. Encod- 
ing the more probable characters shrinks 

Arithmetic compression 
is more effective for 
probable characters because 

you can encode more of 
them in a given number of 
significant digits. 



the range less. For example, encoding an 
a shrinks the range to 3 / 4 of what it was, 
but encoding an i shrinks the range to Vg. 
Arithmetic compression is more effective 
for probable characters because you can 
encode more of them in a given number 
of significant digits. 

USING MULTIBYTE BLOCKS Most com- 
mercially available compression pro- 
grams don't limit themselves to a single 
method of compressing files; they make 
use of a variety of techniques. All the 
techniques we've discussed so far are ex- 
amples of static, probabilistic encoding 
with single bytes. They are probabilistic 
because they use the likelihood of a char- 
acter to determine the number of bits 
needed to code the character. They're 
static because they assume an a priori set 
of character probabilities and thus a fixed 
code. Current lossless compression tech- 
nology goes beyond this kind of encoding 
in three ways. It uses 

• building blocks of more than one char- 
acter, 

• codes tailored to the file (dynamic en- 
coding), and 

• pointers to encode redundant parts of 
the file (Lempel-Ziv). 

To determine how much redundancy 



of information is in the English language, 
Shannon asked people to guess words 
from their first few letters. On this basis 
(and with some further analysis), com- 
puter scientists have come to believe that 
text files require a certain minimal num- 
ber of bits per letter. Typical estimates 
he in the 1.4- to 1.6-bit range, which may 
seem rather surprising since there are 26 
letters in the alphabet. Still, based on this 
estimate, text files can't be compressed 
by more than a factor of 5 or 6 from the 
usual 8 bits per character. 

The fact that text can actually be bet- 
ter compressed depends on the fact that 
there is much more structure to the 
English language than just the fact that 
e is more common than q. There is a ten- 
dency for t to be followed by h and then 
by e, for example, and a q is almost always 
followed by a u. This suggests that instead 
of using single bytes as the objects to be 
encoded, it can be useful to deal with mul- 
tiple-byte blocks. 

One way to do that is through run 
length encoding (RLE), which records a 
character and then the number of times 
it occurs in a string. RLE works well when 
there are long strings of a single repeated 
character, as occurs in some EXE files 
with preinitialized data, in some graphics 
files with long runs of pure white or black, 
and in database files with padded fields. 
It is rarely the best method for most files, 
however, because this situation is rela- 
tively uncommon. Indeed, Version 2.0 of 
PKZIP no longer includes an RLE 
method. 

Another approach is a dictionary- 
based system. You might, for example, 
build a dictionary composed of the 65,280 
most common words and phrases in the 
English language. (65,280 is 64K— 64* 
1,024 — minus 256; you need to reserve 
256 entries for the ASCII codes for letters 
so you can handle words that are not in 
the dictionary.) You'd then encode a file 
in 2-byte pieces (16 bits are needed to ad- 
dress 65,536 different possibilities), 
where each 2-byte code corresponds to 
an entry in the dictionary. For words that 
are not found in the dictionary, this 
method will double the size of the space 
that is needed since 2 bytes must be used 
in order to encode each original byte. On 
the other hand, the encoding would re- 
place many words of 5 or 6 or even more 
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bytes with codes of only 2 bytes. 

e the 64K dictionary possibilities 
do not occur with equal probability in 
files to be compressed, you could im- 
prove on this scheme by using Huffman 
or arithmetic encoding to further com- 
press the first iteration of encoded text. 
In the first pass you'd encode the text by 
representing it in terms of 2-byte codes, 
as described above. In the next step, the 
most common codes would be repre- 
sented by short bit-strings while the un- 
common codes are mapped to longer bit- 
strings, causing further reduction in size. 
This encoding method was used in a 

One reason compression 
works is that ASCII-encoded 
files have redundant 
information. 



program called Stomp, which was popu- 
lar on BBSs several years ago. Stomp 
nf • caught on because its method is in- 
ht .itly slow and has some additional 
limitations. It's slow because of the large 
number of possibilities that must be 
scanned. It's also limited to text files and 
requires a large data file (the dictionary) 
to be on hand. Still, the method is a likely 
candidate for the tightest compression of 
text. 

DYNAMIC ENCODING Everything I've 
discussed so far sets up a fixed encoding 
mechanism for all files. But a set of prob- 
abilities that's good for text will be bad 
for database files, and a set of probabili- 
ties that works for database files will be 
bad for EXE or WK1 files. Clearly, it 
would be best to tailor the probabilities 
to the file. That's in reality what every 
commercial compression package does. 

The simplest way to implement file- 
based encoding is to compute the proba- 
bility distribution of the bytes used in the 
file and use those distributions as the ba- 
sis for Huffman or Fano-Shannon encod- 
ing. However, this method requires that 
the file be read twice. That means it can- 
not be used for real-time compression ap- 
itions such as modem communica- 
tions or hard disk backup to tape. 

To meet the demands of real-time 



compression there are methods that re- 
compute probabilities and codes as they 
go along. The probabilities encountered 
in the first part of the file are used to opti- 
mally encode the later parts. For exam- 
ple, the algorithm might start with an a 
priori distribution, and then every 256 
bytes recompute the distribution based 
on what has gone before. These methods 
are called dynamic or adaptive encoding. 

You might think that since the codes 
keep changing, dynamic encoding sys- 
tems would present problems for decom- 
pression. Fortunately, however, the de- 
compressor can always reconstruct the 
method the compressor used. This is be- 
cause at any stage the decompressor has 
exactly the same information available to 
it that the compressor had at that stage — 
namely, the part of the file that has been 
uncompressed up to that point. 

Behind much modern lossless com- 
pression is an idea introduced by Lempel 
and Ziv in 1977-78. This idea has been 
implemented with a number of varia- 
tions, especially in one with some contri- 
butions by Welch, so you'll see references 
to the method as LZ or LZW. The basic 
idea is to use the file itself as a sliding 
dictionary as it is processed. As the com- 
pressing program is about to compress 
the next n bytes of the file, it looks at what 
it has already processed and sees if the 
first x characters of those bytes have 
occurred in the past y bytes. If so, it stores 
two pieces of information: a pointer to 
where the identical string occurred and 
the number of characters in the string. 

For example, the algorithm might 
look back at the past 1,024 characters for 
an identical string with a length of 16 or 
fewer characters. In this example, 10 bits 
are needed to store the number from 1 
to 1,024 that tells how far back the string 
was, and 4 bits are then needed to store 
the length of the string. Thus, the total 
number of bits needed is 14. This con- 
trasts very favorably with the maximum 
of 128 bits that would be needed to repre- 
sent a 16-byte string— it's a compression 
factor of more than 9. 

It is a remarkable fact that there are 
enough repetitions in most files that this 
method is often very effective. Some- 
times supplemented by a probabilistic 
method, it is the basis of many types of 
compression programs, including disk 



compressors (for example, Stacker and 
DOS 6's DoubleSpace), archivers (for ex- 
ample, PKZIP, ARC, and LHA), and 
tape backup programs. 

USEFUL REDUNDANT INFORMATION One 
reason compression works is that ASCII- 
encoded files have redundant informa- 
tion. Once redundancy is removed, how- 
ever, changing a few bits can change the 
original file dramatically. For example, 
changing the ten-bit pointer in the LZ en- 
coding example above could change the 
decompressed file by 16 bytes! And 
changing the four-bit length component 
can effect the meaning of all future point- 
ers and so change the file even more! 

For this reason, after squeezing out re- 
dundant information, a good compres- 
sion program puts some redundancy back 
in the form of a checksum, such as a CRC 
(cyclic redundancy check). This lets the 
program warn you if the file has been cor- 
rupted in any way. The compression algo- 
rithm squeezes out a lot of useless redun- 
dancy and then puts back some useful 
redundancy. 

In sum, the two most important tech- 
niques in lossless compression are using 
the nonrandom distribution of bytes 
within files to determine variable length 
coding schemes; and using the fact that 
there are often long repetitions of strings 
within a file to strip out redundancy. The 
first is the basis of Huffman and arithme- 
tic coding; the second is the basis of LZ 
and LZW. 

For further reading, I recommend you 
consider four books. The first two are ac- 
ademic computer science books, and 
while they are somewhat technical and 
dry, they also contain a great deal of valu- 
able information. These are Bell, Cleary, 
and Witten, Text Compression, Prentice 
Hall, 1990, ISBN 0-13-911991-4; and 
Storer, Data Compression, Computer 
Science Press, 1988, ISBN 0-71678156-5. 
For a book written more for program- 
mers, see Nelson, The Data Compression 
Book, M&T Press, 1992, ISBN 1-55851- 
214-4. Finally, for some fascinating re- 
flections on information and data com- 
pression, take a look at Lucky, Silicon 
Dreams, St. Martin's Press, 1989, ISBN 
0-31202960-8. □ 
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